广泛观察到的神经缩放定律,其中错误是训练集大小,模型大小或两者兼而有之的误差,从而促进了深度学习的实质性改进。但是,仅通过缩放来进行这些改进就需要计算和能源成本相当大。在这里,我们专注于数据集大小的错误缩放,并展示在理论和实践中如何超越幂律的扩展,并将其减少到指数缩放,如果我们可以访问高质量的数据修剪指标,以将顺序排名为应该丢弃哪些培训示例以实现任何修剪的数据集大小。然后,我们通过经验修剪的数据集大小来测试这一新的指数缩放预测,并且实际上观察到了在CIFAR-10,SVHN和Imagenet训练的重新NET上的功率定律缩放性能。鉴于找到高质量的修剪指标的重要性,我们对ImageNet上十个不同的数据修剪指标进行了第一个大规模的基准测试研究。我们发现大多数现有的高性能指标尺寸较差,而对于ImageNet来说,最佳尺度是计算密集型的,并且需要为每个图像标签。因此,我们开发了一种新的简单,便宜和可扩展的自我监督的修剪指标,该指标与最佳监督指标相当。总体而言,我们的工作表明,发现良好的数据指标可能会为可行的途径提供可行的途径,从而大大改善神经缩放法律,从而降低现代深度学习的资源成本。
translated by 谷歌翻译
在这项工作中,我们探讨了随机梯度下降(SGD)训练的深神经网络的限制动态。如前所述,长时间的性能融合,网络继续通过参数空间通过一个异常扩散的过程,其中距离在具有非活动指数的梯度更新的数量中增加距离。我们揭示了优化的超公数,梯度噪声结构之间的复杂相互作用,以及在训练结束时解释这种异常扩散的Hessian矩阵。为了构建这种理解,我们首先为SGD推导出一个连续时间模型,具有有限的学习速率和批量尺寸,作为欠下的Langevin方程。我们在线性回归中研究了这个方程,我们可以为参数的相位空间动态和它们的瞬时速度来得出精确的分析表达式,从初始化到实用性。使用Fokker-Planck方程,我们表明驾驶这些动态的关键成分不是原始的训练损失,而是修改的损失的组合,其隐含地规则地规范速度和概率电流,这导致相位空间中的振荡。我们在ImageNet培训的Reset-18模型的动态中确定了这种理论的定性和定量预测。通过统计物理的镜头,我们揭示了SGD培训的深神经网络的异常限制动态的机制来源。
translated by 谷歌翻译
While deep learning has led to remarkable advances across diverse applications, it struggles in domains where the data distribution changes over the course of learning. In stark contrast, biological neural networks continually adapt to changing domains, possibly by leveraging complex molecular machinery to solve many tasks simultaneously. In this study, we introduce intelligent synapses that bring some of this biological complexity into artificial neural networks. Each synapse accumulates task relevant information over time, and exploits this information to rapidly store new memories without forgetting old ones. We evaluate our approach on continual learning of classification tasks, and show that it dramatically reduces forgetting while maintaining computational efficiency.
translated by 谷歌翻译
A central problem in machine learning involves modeling complex data-sets using highly flexible families of probability distributions in which learning, sampling, inference, and evaluation are still analytically or computationally tractable. Here, we develop an approach that simultaneously achieves both flexibility and tractability. The essential idea, inspired by non-equilibrium statistical physics, is to systematically and slowly destroy structure in a data distribution through an iterative forward diffusion process. We then learn a reverse diffusion process that restores structure in data, yielding a highly flexible and tractable generative model of the data. This approach allows us to rapidly learn, sample from, and evaluate probabilities in deep generative models with thousands of layers or time steps, as well as to compute conditional and posterior probabilities under the learned model. We additionally release an open source reference implementation of the algorithm.1. extreme flexibility in model structure, 2. exact sampling,
translated by 谷歌翻译
A central challenge to many fields of science and engineering involves minimizing non-convex error functions over continuous, high dimensional spaces. Gradient descent or quasi-Newton methods are almost ubiquitously used to perform such minimizations, and it is often thought that a main source of difficulty for these local methods to find the global minimum is the proliferation of local minima with much higher error than the global minimum. Here we argue, based on results from statistical physics, random matrix theory, neural network theory, and empirical evidence, that a deeper and more profound difficulty originates from the proliferation of saddle points, not local minima, especially in high dimensional problems of practical interest. Such saddle points are surrounded by high error plateaus that can dramatically slow down learning, and give the illusory impression of the existence of a local minimum. Motivated by these arguments, we propose a new approach to second-order optimization, the saddle-free Newton method, that can rapidly escape high dimensional saddle points, unlike gradient descent and quasi-Newton methods. We apply this algorithm to deep or recurrent neural network training, and provide numerical evidence for its superior optimization performance. This work extends the results of .
translated by 谷歌翻译
Despite the widespread practical success of deep learning methods, our theoretical understanding of the dynamics of learning in deep neural networks remains quite sparse. We attempt to bridge the gap between the theory and practice of deep learning by systematically analyzing learning dynamics for the restricted case of deep linear neural networks. Despite the linearity of their input-output map, such networks have nonlinear gradient descent dynamics on weights that change with the addition of each new hidden layer. We
translated by 谷歌翻译
As language models (LMs) scale, they develop many novel behaviors, good and bad, exacerbating the need to evaluate how they behave. Prior work creates evaluations with crowdwork (which is time-consuming and expensive) or existing data sources (which are not always available). Here, we automatically generate evaluations with LMs. We explore approaches with varying amounts of human effort, from instructing LMs to write yes/no questions to making complex Winogender schemas with multiple stages of LM-based generation and filtering. Crowdworkers rate the examples as highly relevant and agree with 90-100% of labels, sometimes more so than corresponding human-written datasets. We generate 154 datasets and discover new cases of inverse scaling where LMs get worse with size. Larger LMs repeat back a dialog user's preferred answer ("sycophancy") and express greater desire to pursue concerning goals like resource acquisition and goal preservation. We also find some of the first examples of inverse scaling in RL from Human Feedback (RLHF), where more RLHF makes LMs worse. For example, RLHF makes LMs express stronger political views (on gun rights and immigration) and a greater desire to avoid shut down. Overall, LM-written evaluations are high-quality and let us quickly discover many novel LM behaviors.
translated by 谷歌翻译
As AI systems become more capable, we would like to enlist their help to supervise other AIs. We experiment with methods for training a harmless AI assistant through self-improvement, without any human labels identifying harmful outputs. The only human oversight is provided through a list of rules or principles, and so we refer to the method as 'Constitutional AI'. The process involves both a supervised learning and a reinforcement learning phase. In the supervised phase we sample from an initial model, then generate self-critiques and revisions, and then finetune the original model on revised responses. In the RL phase, we sample from the finetuned model, use a model to evaluate which of the two samples is better, and then train a preference model from this dataset of AI preferences. We then train with RL using the preference model as the reward signal, i.e. we use 'RL from AI Feedback' (RLAIF). As a result we are able to train a harmless but non-evasive AI assistant that engages with harmful queries by explaining its objections to them. Both the SL and RL methods can leverage chain-of-thought style reasoning to improve the human-judged performance and transparency of AI decision making. These methods make it possible to control AI behavior more precisely and with far fewer human labels.
translated by 谷歌翻译
This paper revisits the work of Rauch et al. (1965) and develops a novel method for recursive maximum likelihood particle filtering for general state-space models. The new method is based on statistical analysis of incomplete observations of the systems. Score function and conditional observed information of the incomplete observations/data are introduced and their distributional properties are discussed. Some identities concerning the score function and information matrices of the incomplete data are derived. Maximum likelihood estimation of state-vector is presented in terms of the score function and observed information matrices. In particular, to deal with nonlinear state-space, a sequential Monte Carlo method is developed. It is given recursively by an EM-gradient-particle filtering which extends the work of Lange (1995) for state estimation. To derive covariance matrix of state-estimation errors, an explicit form of observed information matrix is proposed. It extends Louis (1982) general formula for the same matrix to state-vector estimation. Under (Neumann) boundary conditions of state transition probability distribution, the inverse of this matrix coincides with the Cramer-Rao lower bound on the covariance matrix of estimation errors of unbiased state-estimator. In the case of linear models, the method shows that the Kalman filter is a fully efficient state estimator whose covariance matrix of estimation error coincides with the Cramer-Rao lower bound. Some numerical examples are discussed to exemplify the main results.
translated by 谷歌翻译
“感应头”是注意力头,它实现了一种简单的算法来完成令牌序列,例如[a] [b] ... [a] - > [b]。在这项工作中,我们提供了一个假设的初步和间接证据,即诱导头可能构成大型大型变压器模型中所有“文本学习”中大多数的机制(即减少在增加代币指数时损失的损失)。我们发现,诱导头在与秘密学习能力突然急剧上的急剧上升的位置完全相同,这是训练损失的颠簸。我们提出了六种互补的证据,认为诱导头可能是任何大小的变压器模型中一般性内部学习的机理来源。对于仅关注的小型模型,我们提供了有力的因果证据。对于具有MLP的较大模型,我们提供相关证据。
translated by 谷歌翻译